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ABSTRACT 

The aspect of the Dnified Science and Mathematics for 
Elementary Schools (USMES) project described in this paper was 
undertaken in an effort to observe the problem solving behaviors of 
elementary school children. The Notebook Problem consisted of ^ 
presenting a student with three notebooks arranged so as to differ 
from each other in terms of such dimensions as number of pages, 
number of lines per page, binding, price, etc.; the subject was asked 
to (1) select the most appropriate one for his class, and (2) 
indicate the reasons for his selection. Pretests and posttests were 
administered to randomly selected students from both control and 
OSMES project groups. Scoring of responses was performed along the 
following lines: (1) whether any of the sabject*s reasons for^ 
selection were stated in measurable quantities, and (2) the highest 
level of warrant associated with the reasons stated. Representatives 
of the dimensions measurable were: (1) size-volume, (2) weight, (3) 
cost, etc,; while levels of warrant were determined by responses 
being: (1) personal opinion, (2) testable, or (3) had been tested, 
Chi-square analysis revealed significant improvement in pretest to 
posttest scores for the experimental group versus the control group, 
(Author/CP) 
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Introduction 

The evaluation of the Unified Science and Mathematics for the Elemen- 
tary Schools (USME3) program in 1971-1972 encompassed a number of different 
strategic approaches in both developmental (first stage), implementation 
(second stage), and control (comparison) classes at various grade levels. 
Included in the range of evaluation techniques were teacher logs, classroom 
observations, and standardized tests. In addition to the data yielded by 
these approaches, however, there 'was a desire to observe the problem solving 
behavior of elementary school children in a situation which was standardized 
and structured but which provided the subjects an opportunity to consider 
and test hypotheses with concrete materials. 

In order to accomplish this purpose, the Notebook Problem was devised. 
It consisted essentialiy of presenting the testee with three notebooks selected 
so as to differ from each other in terms of such dimensions as number of pages, 
number of lines per page, binding, price, space between lines, width of ruled 
margin, etc. etc., and asking the testee to (a) select the most appropriate 
one for his class, and (b) indicate the reasons for his selection. The test 
was designed to be administered individually, and the precise directions given 
by the tester to the testee were as follows: 

Suppose (insert Principal' s name) decided that 
all the (Insert testee 's grade level) grade 
should have notebooks to keep their science 
and math work in. He writes a notebook company 
and asks for samples. They send him three 
(point to each notebook) notebooks. He (she) 
comes to you and say5,'nnsert testee's name), 
I need your help." 'Vhich of these books would 
be the best for the (insert testee's grade level) 
grade to keep their science and math work in?" 
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Procedure 

The administration of the notebook problem test was limited to USMES 
implementation classes and their corresponding control groupings. The 
initial sample consisted of seventeen USMES and seventeen control schools 
with forty-three experimental and thirty-one control classes involved. In 
both categories classes in grades two through six were represented as were 
dll current USMES implementation units. The test was to be administered 
at both the beginning (pretest) and end (posttest) of the school year, but . 
given the necessity for individual administration, it was impracticable to 
administer to all members of each class unit. Therefore, testers (who were 
the classroom observers already being used for USMES activities) were 
asked to randomly select five pupils from each of the designated classrooms 
as testees. 'The pretest and posttest selections were made independently 
since it was felt that practice effects would be both large and uncontrolled 
in this kind of test setting. 

Testees were taken from the classroom for the test admini stration« 
Each tester was asked to allow as much time as necessary for the testee 
to complete his or her work or the problem and to encourage as full a response 
as possible. Specific tools such as paper, pencil, and ruler were available 
but not pressed upon the testee. 

ResulUs 

II I I I I ll I I 

(a) Sample 

Administration of the notebook test proved to be somewhat more difficult 
than originally envisioned. A number of particular administrative problems 
arose (e.g., availability of testees for testing, time of testing, school 

5 
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schedules, etc.), which made it impo-^sible for the full quota of five pupils 
per classroom unit to be met. At t\i. end of the school year pretest and post- 
test results were available for seven school districts (grades two through 
six), the total sample comprising of thirty-one USMES classes and twenty-two 
controls. The number of pupils tested in each class varied from two to sixp 
and the final sample consisted of two hundred and twenty-seven pretests (one 
hundred and thirty-two USMES, ninety-five controls) and two hundred and forty- 
six posttests (one hundred and forty-four USMES, one hundred and six controls). 
Seven of the USMES teachers were male and twenty-four were female. Their 
variation in terms of teaching experience was quite marked. The range was from 
one to twenty years with about twenty-five percent of the thirty-one USMES 
teachers having less than three years of classroom experience, ' The demographic 
variation of the seven school districts was also considerable. Three were at 
the center of large urban areas with largely lower and lower middle class 
pupils; three were in suburban areas with largely middle and upper class pupils, 
and one was a rural area with a student group of mixed socio-economic back- 
grounds. 

(b) Test Data 
(i) Scoring 

All testers transcribed verbatim the testees' responses. These responses 
were then typed in preparation for scoring of the protocols. After examination 
of several pilot protocols, a rather elaborate set of response categories had 
been developed (cf. Appendix C) , In examining the actual date, however, it ap« 
peared that there was not, in general, enough variability in the subject respon- 
ses to make all of t\e sub-categories operative. In addition, there were some 
aspects of programmatic concern (e,g,, whether or not subjects resorted to 
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direct measurement in their problem solving) which were not being addressed. 
Therefore, a simpler but more direct schema was developed. In this new 
approach each protocol was to be assessed in terns of: 

(a) whether or not any of the subject's reasons for selection were 
stated along dimensions that* were measurable within the test 
situation, and 

(b) the highest level of warrant associated with the given reasons for 
se lection. 

The dimensions measurable within the test situation were (i) size-volume 
(e.g., "bigger sheets than the small one," etc.), (ii) weight (e.g., "heavier" 
etc.), (iii) quantity (e.g., "more sheets than the large one," 'HnoreMines per 
page," etc.), and (iv) cost (e.g., "doesn't cost as much as the big one," "costs 
less for number of sheets," etc.). Three categories were developed for level 
of warrant. These were (i) reasons given was expressed simply as a personal 
opinion, (ii) a test was suggested to assess the reason given, and (iii) a 
test was actually performed to test the reason given. These levels were con- 
sidered a hierarchy in increasing order of appropriateness, and each protocol 
was assigned to the highest level present among the several that an individual 
subject may have used. 

Four graduate students in mathematics and science education were trained 
for a full day in the use of the scoring categories wifh the help of sample 
protocols. Inter-judge reliability was assessed through the intra-class 
correlation. The coefficients yielded at the start of training were +.67 and 
•f. 71 for the measurement dimension and the level of warrant respectively. At 
the end of a day of training, the corresponding coefficients were -f-.87 and +.89. 
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In the final scoring of the protocols, the pretests and postests were inter- 
mixed, and the protocol pool was then randomly divided among the four raters, 
(ii) Data Analysis 

In terms of whether or not any of the subject's reasons for selection of 
a particular notebook were stated along dimensions that were measurable within 
the test situation, the summary data by District areas are presented in Tables 
1 and 2 for the USMES and Control classes respectively. The pattern in these 
tables is quite clear. At the beginning of the school" year, the pretest data 
indicate that in all districts and in both USMES and Control classes, only 
a small minority of the pupils state reasons for their notebook selection in 
terms of dimensions that are directly measurable within the test situation. 
At the end of the year, however, the posttest data indicate that for the USMES 
group there has been a considerable change. At this later juncture, the USMES . 
pupils are virtually all responding in terms of measurable dimensions with only 
eight of one hundred and forty-four USMES stqdents having protocols whose 
response rationales all fall into the non-measurable category. As indicated 
in Table 1, Chi Square contingency tests indicate a statistically significant ' . 

Table I about here 

relationship between test-time (i.e., pretest vis a vis posttest) and category 
response (i.e., n,easurable vis a vis non-measurable reason for notebook selection) 
In all USMES districts. As described above, in each case, this shift is from 
the use of non-measurable dimensions to the use of measurable ones. 

The posttest situation with the control classes, however, presents a 
startling contrast. The great majority of these subjects (cf. Table 2) con- 
tinued as in the pretest to offer rationales stated in terms of non-measurable 
dimensions. Thus, in all eight .District/School areas for which complete control 
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•group data were^ available > the Chi Square contingency tests yielded no 

Table 2 about here 
statistically significant relationship between test- time^ and response 
category. 

The individual classroom data for both pretests and posttests along 
the measurement/non-measurement dimension are given in Appendix A to this 
report. Examination of these data for both the USMES classes (Appendix Ai 
Table 1) and the Control classes (Appendix A, Table 2) consistently substan- 
tiate uy classroom unit the results described above for the District areas. 
Although the small N*s per individual class made statistical tests inappro- 
priate, it is apparent that as with the summary District areas, there was 
in the USMES groups a shift over the treatment period from the use of non- 
measurable dimensions to the use of measurable ,ones while over the same period, 
there was no comparable shift in the Control groups. The extreme consistency 
of this pattern appeared to make special tests fpr grade levels USMES unit 
groups, teacher , experience , etc,, unnecessary. 

In terms of the Level of Warrant associated with the subject's reasons 
for selection of a particular notebook, the summary data by District areas 
are presented in Tables 2 and 4 for the USMES and Control classes respec t ively » 
As with the previous data, the pattern in these tables is quite clear. At 
the start of the school year, the pretests indicate that in all districts and 
in both USMES and Control classes, the great majority of students rationalized 
their notebook selections solely in terms of personal opinion. Some few did 
suggest a test of their hypotheses, but virtually none actually teste's their 
rationales in the test situation. At the end of the school year, however, the 
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posttests show that for the USMES group there was a marked change in test 
behavior. At the end of the school year - after having worked with the various 
USMES materials - the USMES classes are using higher levels of warrant. Almost 
all of the USMES pupils are either suggesting tests that would assess the 
validity of their notebook selection or actually performing the test within 
the problem solving situation. Only twelve of the one hundred and forty-four 
USMES students tested at posttest time offered personal opinion as a warranty 
of tneir response. As indie. ced in Table 3. Chi Square contingency tests 
indicate a statistically significant relationship between test-time (i.e., 

Table 3 about here 

pretest vis a vis posttest) and category response (i.e., personal opinion, 
suggested test, actual test) in all USMES districts. As described above, in 
each case this relationship indicates a shift from personal opinion to sug- 
gested and actual testing as warrants for recor-rded action. 

m the control classes, however, there appeared no striking shift from 
the. use of personal opinion as a warrant for notebook selection. As intficated 
in Table 4, the great majority of the control subjects continued as in the 

Table 4 about here 

pretest situation to rely on personal opinion. Only sixteen of one hundred 
and SIX pupils suggested a test to validate their selection and none actually 
performed a test in the problem solving setting. Thus, in all eight District 
areas for which complete control group data were available, the Chi Square 

A H nn c^rari sticallv significant relationship between 
contingency tests yielded no statistically ^j-B" 

test-time and frequency of use of the various levels of warrant. 

The specific classroom data for both pretests and posttests along the 
level of warrant ^:imension are given in Appendix B to this report. Examination 
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of these data for both the USMES classes (Appendix B, Table 1) and the Control 
classes (Appendix B, Table 2) consistent! y substantiate by individual classroom 
unit the results as described above for the sununary areas. Although as with the 
measurable/non-measurable dimension, the small N*s per individual class made 
statistical tests inappropriate, it is obvious that as with the larger areas, 
there vjas in the USMES groups a change from the use of personal opinion as a 
warrant to the use of suggested and actual tests. Over the same period, how- 
ever, there was no comparable shift in the control groups. The basic consis- 
tency of this pattern made further comparisons by grade level, USMES unit, 
etc. , inconsequential. 

Cone lus ion 

The present aspect of the 1972-73 USMES evaluation was undertaken in an 
effort to observe the problem solving behavior of elementary school children 
(both USMES and controls) in a situation which was standardized and structured 
but which provided the subjects an opportunity to consider and test hypotheses 
with concrete materials. The dependent variables of concern were (a) the 
use of rationales stated in terms of dimensions that were measu^rable within 
the problem solving situation, and (b) the level of warrant associated with 
solutions in the same problem solving situation. 

The problem solving behavior of the children was observed at both the 
beginning and the end of the school year, the treatment being the use of one or 
other of the USMES units between these two occasions. Data analyses indicated 
that both USMES and control groups began the school year (a) by using non- 
measurement dimensions in their problem solving and (b) by relying on personal 
opinion as the warrant for the validity of their solutions. At the end of 
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nhe school year, however, the USMES pupils were relying primarily on measure- 
ment dimensions and using suggested and actual tests to validate their work. 
The control pupils on the other hand continued to exhibit the pattern of be- 
havior of the pretest situation. Thus, it would appear that in terms of the two 
dependent variables studied, the USMES experience had, Irrespective of units 
and teachers involved, a marked and positive effect on the students' problem 
solving behavior. 



Caveat 
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The present effort was intended to be an eKploratory "first step" In 
the development of a series of problem solving tasks appropriate to the eval- 
uation of USMES-type programso Xt suffered from some of the difficulties of ^ 
first stage developments. The directions given to the testers were not 
adequately specific so that there was considerable variation in the style of 
administration. For example,, it was unclear to the testers just how much 
time was "adequate"' or how to judge when a testee had finished. Further, there 
is some difficulty over the partially "ex post facto" nature of the rating 
categories, and these would certainly need to be cross validated in future 
work. Finally, it must be pointed tha the control classes cannot be 

considered comparable to the USMES groups in the strict sense of the term in 
that very related standards were used in their selection although efforts were 
made to chosse groups in the same school and at the same grade level. 

The redeeming aspect to these difficulties lies in the clarity and con- 
aiotency of the actual results. Nevertheless, future work in this area ahould 
take steps to eliminate or at least reduce the confounding effects of these 
variations . 
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APPENDIX A 

Classroom Data on Reasons for Selection 
Measurable vs. Non-Measurable 
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APPENDIX B 



Classroom Data on Level of Warrant 

(i) Opinion 
(ii) Suggested Test 
(iii) Actual Test 
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Original Response Categories 
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L MEASURABLE (during test situation) 

a. size -volume 

b« weight 

c, quantity 

d, cost 



II, TESTABLE (could be studied, escamined in future) 

a» versa til i ty 

b. construction (durabilityo lack of defects) 

c. manageability 

d. health 

e. specific utility 



IIIo QUALITATIVE (general statements of opinion) 

a. attractiveness 

b. appeal to: 

tradition 
authority 
peers 

personal preference 
nationalism (made in USA) 

c. prestige 
do uniqueness 
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